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Abstract. Starting from a heuristic learning sclieme for strategic «-person games, we de- 
rive a new class of continuous-time learning dynamics which consist of a replicator-like 
term adjusted by an entropic penalty that keeps players' strategies away from the bound- 
ary of the game's strategy space. These entropy-driven dynamics are equivalent to players 
taking an exponentially discounting aggregate of their on-going payoffs and then using a 
quantal response choice model to pick an action based on these performance scores. Owing 
to this inherent duality, these dynamics satisfy a variant of the folk theorem of evolutionary 
game theory and converge to (arbitrarily precise) quantal approximations of Nash equilib- 
ria in potential games. Motivated by applications to traffic engineering, we exploit this 
duality in order to design a discrete-time, payoff-based learning algorithm which retains 
these convergence properties and only requires players to observe their in-game payoffs: 
in fact, the algorithm retains its robustness in the presence of stochastic perturbations and 
observation errors, and does not require any synchronization between players. 



1. Introduction 

O^wing to the computational complexity of Nash equilibria and related game-theoretic 
solution concepts, algorithms and processes for learning in games have received consider- 
able attention over the last two decades. Such procedures can be divided into two broad 
categories, depending on whether they evolve in continuous or discrete time: the former 
class includes the numerous dynamics for learning and evolution (see e.g. Sandholm [29] 
for a recent survey), whereas the latter focuses on learning in infinitely iterated games, 
such as fictitious play and its variants - for an overview, see Fudenberg and Levine [12] 
and references therein. 

A key challenge in these endeavors is that it is often unreasonable to assume that play- 
ers are capable of monitoring the strategies of their opponents - or even of calculating the 
payoffs of actions that they did not play. As a result, much of the literature of learning 
in games revolves around payoff-based adaptive schemes which only require players to 
observe the stream of their in-game payoffs: for instance, in the framework of cumula- 
tive reinforcement learning, players use their observed payoff information to score their 
actions based on their estimated performance over time, and they then use a fixed decision 
model (such as logit choice) to determine their actions at the next instance of play. The 
convergence of such algorithms in 2-player games has been studied from a 2-leaming per- 
spective by Leslie and Collins [19] and Tuyls et al. [34] whereas, more recently, Cominetti 
et al. [10] and Bravo [9] took a moving-average approach for scoring actions in general A^- 
player games and studied the long-term behavior of the resulting dynamics. Interestingly, 
in all these cases, when the learning process converges, it converges to a perturbed Nash 
equilibrium of the game - viz. a fixed point of a perturbed best-response correspondence 
(Fudenberg and Levine [12]). 
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Stochastic processes of this kind are usually analyzed with the ODE method of stochas- 
tic approximation which essentially compares the long-term behavior of the discrete-time 
process to the corresponding mean-field dynamics in continuous time - see e.g. the sur- 
veys by Benai'm [5] and Borkar [7]. Indeed, there are several sufficient conditions which 
guarantee that a discrete-time process and its continuous counterpart both converge asymp- 
totically to the same sets, so continuous dynamics are usually derived as the limit of a 
discrete-time (and possibly random) process rooted in some adaptive learning scheme - cf. 
the aforementioned works by Leslie and Collins [19], Cominetti et al. [10], and Bravo [9]. 

Contrary to this approach, we proceed from the continuous to the discrete and develop 
two different learning processes from the same dynamical system (the actual algorithm 
depends crucially on whether we look at the evolution of the players' strategies or the 
performance scores of their actions). Accordingly, the first contribution of our paper is 
to derive a class of entropy-driven game dynamics which consist of a replicator-like term 
plus a barrier term that keeps players from approaching the boundary of the state space by 
imposing an entropic penalty to their payoff's - hence the dynamics' name. Interestingly, 
these dynamics are equivalent to players scoring their actions by taking an exponentially 
discounted (and continuously updated) aggregate of their payoffs and then using a quantal 
choice model to pick an action (McKelvey and Palfrey [22]); as such, entropy-driven dy- 
namics constitute the strategy-space counterpart of the Q-learning dynamics of Leslie and 
Collins [19] - see also Tuyls et al. [34]. 

Another important feature of these dynamics is their temperature, a parameter which 
specifies the relative weight of the dynamics' entropic barrier term with respect to the 
game's payoff's - and also measures the weight that players attribute to past events, viz. 
the discount factor of their payoff" aggregation scheme. These considerations allow us to 
derive a number of quite general results such as the dynamics' convergence to quantal 
response equilibria (QRE) in potential games and an extension of the well-known folk 
theorem of evolutionary game theory (Hofbauer and Sigmund [14]). In particular, we 
show that stability and convergence depend crucially on the temperature of the dynamics: 
at zero temperature, strict Nash equilibria are the only stable and attracting states of the 
dynamics, just as in the case of the replicator equation; for negative temperatures, all pure 
action profiles are attracting (but with vastly different basins of attraction), whereas, for low 
positive temperatures, only QRE that are close to strict equilibria remain asymptotically 
stable. 

The second important contribution of our paper concerns the practical implementation 
of entropy-driven game dynamics as a learning algorithm with the following desirable 
properties: 

(1) The learning process is payoff-based, fully distributed and stateless - players only 
need to observe their in-game payoff's and no knowledge of the game's structure 
or of the algorithm's state is required. 

(2) Payoff's may be subject to stochastic perturbations and observation errors; in fact, 
payoff observations need not even be up-to-date. 

(3) Updates need not be synchronized - there is no need for a global update timer used 
by all players. 

These properties are key for the design of robust, decentralized optimization protocols in 
network and traffic engineering, but they also pose significant obstacles to convergence. Be 
that as it may, the convergence and boundary-avoidance properties of the continuous-time 
dynamics allow us to show that players converge to arbitrarily precise quantal approxima- 
tions of strict Nash equilibria in potential games (Theorem 4.3 and Propositions 4.4, 4.5). 
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Thus, thanks to the congestion characterization of such games (Monderer and Shapley 
[25]), we obtain a powerful distributed optimization method for a wide class of engineer- 
ing problems, ranging from traffic routing (Altman et al. [1]) to wireless communications 
(Mertikopoulos et al. [23]). 

1.1. Paper outline and structure. After a few preliminaries in the rest of this section, 
our analysis proper begins in Section 2 where we introduce our cumulative reinforcement 
learning scheme and derive the associated entropy-driven game dynamics. Owing to the 
duality between the evolution of the players' mixed strategies and the performance scores 
of their actions (measured by an exponentially discounted aggregate of past payoffs), we 
obtain two equivalent formulations of the dynamics: the score-based integral equation 
(ERL) and the strategy-based dynamics (ED). 

In Section 2, we exploit this interplay to derive certain properties of the entropy-driven 
game dynamics, namely their convergence to perturbed equilibria in potential games (Pro- 
position 3.4), and a variant of the folk theorem of evolutionary game theory (Theorem 3.7). 
Finally, Section 4 is devoted to the discretization of the dynamics (ERL) and (ED), yielding 
Algorithms 1 and 2 respectively. By using stochastic approximation techniques, we show 
that when the players' learning temperature is positive (corresponding to an exponential 
discount factor A < I), then the strategy -based algorithm converges almost surely to per- 
turbed strict equilibria in potential games (Theorem 4.3), even in the presence of noise 
(Proposition 4.4) and/or update asynchronicities (Proposition 4.5). 

1.2. Notational conventions. If S = is a finite set, the vector space spanned by 
S over R will be the set of all maps x: S — > R, i e S i-> x, € R. The canonical 
basis {e.s).5es of this space consists of the indicator functions e, : S — > R which take the 
value esis) = I on s and vanish otherwise, so thanks to the natural identification s i-» e^, 
we will make no distinction between s e § and the corresponding basis vector e, of R^. 
Likewise, to avoid drowning in a morass of indices, we will frequently use the index a to 
refer interchangeably to either Sa or e„ (writing e.g. x„ instead of the more unwieldy XsJ; 
in a similar vein, if {§k]kex is a finite family of finite sets indexed hy k e %, we will use 
the shorthands (a; a^k) for the tuple (ao, ■ ■ . , Ok-i, a, Ok+i ,■■■)& Ylk and 2a place of 

We will also identify the set A(§) of probability measures on S with the unit n-dimen- 
sional simplex A(S) = {x e R^ : Yja^a = 1 and Xa > 0) of R^. Since A(S) is a smooth 
submanifold-with-corners of R^, by a smooth function on A(§) we will mean a C°° function 
in the smooth structure that A(§) inherits from R^ (Lee [18]). Moreover, if So = § \ {^o), 
we will write proj()(x) = x-o = (xi, . . . , x„) for the induced surjection x e R^ i-> x|so e R^°- 

Regarding players and their actions, we will follow the original convention of Nash 
and employ Latin indices (k, £,...) for players, while keeping Greek ones (a,yS, . . . ) for 
their actions (pure strategies); finally, unless otherwise mentioned, we will use a,l3, . . . , 
for indices that start at 0, and ju, v, . . . , for those which start at 1. 

1.3. Definitions from game theory. A finite game ® = ^{Ji,A,u) will be a tuple con- 
sisting of a) a finite set of players 3\f = {1, . . . , A^); b) a finite set Ak of actions (or pure 
strategies) for each player k & Ji; and c) the players' payoff functions Uk'. A — > R, 
where A = WkAk denotes the game's action space, i.e. the set of all action profiles 
{a\, . . . ,af^), Qk e Ak- More succinctly, if A* = Uk-^k = {{o:,k) : a e Ak} is the 
disjoint union of the players' action sets, then the payoff map of (5 will be the map 
m: yi — > R'^* = 'R'^' which sends the profile (ai, . . . ,0^) e yi to the payoff vector 
(ukia; a:-k))aeAt,key< ^ H* 'R'^' ■ ^ restriction of ® will then be a game ©' s ©'(Jsf, A', u') 
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with the same players as (5, each with a subset A'/^ c Ak of their original actions, and with 
payoff functions m^, = mjU' suitably restricted to the reduced action space A' - Ylk-^'k 
©'. 

Needless to say, players can mix their actions by taking probability distributions - 
(xka)aeAt ^ M-^k) over their action sets Ak- In that case, their expected payoffs will be 

■■■> Uk(ai,-..,aN)xi^ai--- XN,af,, (1-1) 

where x - (xi, . . . , x^) denotes the players' (mixed) strategy profile and Ukiai, . . . , a^) is 
the payoff to player k in the (pure) action profile (ai, . . . ,aN) e A;^ more explicitly, if 
player k plays the pure strategy a e Ak, we will use the notation Uka{x) = Ukia; x^k) - 
Uk{xi, . . . ,a, . . . , xn). In this mixed context, the strategy space of player k will be the 
simplex Xk = ^{Ak) while the strategy space of the game will be the convex polytope 
X = Ylk^k- Together with the players' (expected) payoff functions Uk'. X R, the tuple 
(3Nf, X, u) will be called the mixed extension of (5 and it will also be denoted by © (relying 
on context to resolve any ambiguities). 

The most prominent solution concept in game theory is that of Nash equilibrium (NE) 
which characterizes profiles that are resilient against unilateral deviations; formally, q e X 
will be a Nash equilibrium of (5 when 

Uk(xk; q-k) ^ Ukiq) for all Xk e Xk and for all A; e IN". (NE) 

In particular, if (NE) is strict for all Xk & Xk\ {qk}, k eJi, then q will be called a strict Nash 
equilibrium. 

An especially relevant class of finite games is obtained when the players' payoff func- 
tions satisfy the potential property : 

Uka(x) - Ukpix) = U{a; x^k) - U(fi; x^k) (1-2) 

for some (necessarily) multilinear function U: X ^ R. When this is the case, the game 
will be called a potential game with potential function U, and as is well known, the pure 
Nash equilibria of (5 will be precisely the vertices of X that are local maximizers of U 
(Monderer and Shapley [25]). 

2. Reinforcement learning and entropy-driven dynamics 

In this section, our aim will be to derive a class of continuous-time learning dynamics 
based on the following cumulative reinforcement premise: agents accumulate a long-term 
"performance score" for each of their actions and they then use a (smooth) choice function 
to map these scores to strategies and continue playing. More precisely: 

(1) The assessment phase (Section 2.2) will comprise the scheme with which players 
aggregate past payoff information in order to update their actions' performance 
scores. 

(2) The choice phase (Section 2.3) will then describe how these scores are used to 
select a mixed strategy. 

For simplicity, we will first derive the dynamics that correspond to this learning pro- 
cess in the case of a single player whose payoffs are determined at each instance of play 
by Nature — the case of several players involved in a finite game will be entirely similar. 



Recall that we will be using a for both elements a e Ak and basis vectors e A(ylt), so there is no clash 
of notation between payoffs to pure and mixed strategies. 



ENTROPY-DRIVEN DYNAMICS AND ROBUST LEARNING PROCEDURES IN GAMES 



5 



Furthermore, the passage from discrete to continuous time will be done here at a heuris- 
tic level and we will assume that players have perfect payoff information, that is: a) they 
are assumed able to observe or otherwise calculate the payoffs of all their actions; and 
b) unless mentioned otherwise, this payoff information will be assumed accurate and not 
subject to measurement errors or other exogenous perturbations. The precise interplay be- 
tween discrete and continuous time and the effect of imperfect information and stochastic 
fluctuations will be explored in detail in Section 4. 

2.1. The model. In view of the above, our single-player learning model will be as fol- 
lows: at time t, an agent makes a discrete choice from the elements of a finite set A - 
{ao,ai, . . . ,a„} (representing e.g. the routes of a traffic network, different stock options, 
etc.). We will denote the payoff to a e ^1 at time t by u^it), and the agent's assessment 
of his actions' performance up to instance t will be represented by the score variables 
ya(t) € R, a € A. In this context, the assessment phase will describe how ya{t) is updated 
using the payoffs Ua(s), s < t, of all past instances of play, whereas the choice stage will 
specify the choice map Q: IR'^ — > X = A{A) which prescribes the agent's mixed strategy 
X e X given his assessment of each action a e Aso far. 

2.2. The assessment stage: memory and aggregation of past information. Assuming 
for the moment that the agent plays at discrete time intervals s - 0, 1, . . . , f, the class 
of assessment schemes that we will consider is the familiar and widely used exponential 
model of long-term performance evaluation 



where A € (0, -i-oo) is the model's discounting parameter, m„(0 is the sequence of payoffs 
corresponding to a e yi, and we are assuming for the moment that the model is initially 
unbiased, i.e. ya(0) = 0. Clearly: 

(1) For A e (0, 1) we get the standard exponential discounting model which assigns 
exponentially more weight to recent observations. 

(2) If A = 1 we get the unweighted aggregation scheme yait) - udt) which has 
been examined in the context of learning by Rustichini [28], Hofbauer et al. [15], 
Sorin [32], Mertikopoulos and Moustakas [24] and many others. 

(3) For A > I, the scheme (2. 1) assigns exponentially more weight to past instances; 
as such, this case has attracted very little interest in the literature (after all, it seems 
rather counter-intuitive to discount current events in favor of a possibly irrelevant 
past). Nevertheless, we will see that the choice A> 1 leads to some very surprising 
advantages, so we will not exclude this parameter range from our analysis. 

Now, if the agent plays at discrete time intervals Q,h, . . . ,nh = t with time step h > 0, 
the exponential model (2.1) should be replaced by the scale-invariant version: 



where the factor h has been included to make the sum (2.1') intensive in h,- the notation 
[0 : f : h] represents the index set {0,h,2h, . . . ,nh - t], and we plead guilty to a slight 
abuse of notation for not differentiating between s and s/h in the argument of m„ (and 



^Note that the sum (2. 1') consists of 0(1//)) terms that are 0(1) in /i so it would scale extensively with h ' if 
not scaled by li. 




(2.1) 




(2.1') 
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between n and t - nh in the case of ya). Accordingly, the assessment scheme (2.1') yields 
the recursive updating rule: 

yait) = u,{t)h + A''y,(t - h), (2.2) 

which in turn shows that the updating of (2.1') does not require the agent to have infinite 
memory: scores are simply updated by adding the payoffs obtained over a period h to the 
scores of the previous period scaled by the discount (or reinforcement) factor A'\ 

In this way, letting /z ^ 0"^ (and assuming Lipschitz-continuous payoff processes Ua{t) 
for simplicity), we readily obtain the continuous -time model 

j„(f) = Ua{t) - Ty^it), (2.3) 

or, in integral form: 

^.(f) = ^^(O)^-^' + r e-^^'-'^u,{s)ds, (2.3') 
Jo 

where 

T = \og{\IA) (2.4) 

represents the learning temperature of our performance assessment scheme (see the fol- 
lowing sections for a justification of this terminology) and the term ya{Q)e''^' reflects the 
initial bias ^^(0) of the agent (initially taken equal to 0). ' In tune with our previous dis- 
cussion, the standard exponential discounting regime A e (0, 1) will thus correspond to 
positive temperatures T > 0, unweighted aggregation will be obtained for T — > 0, and 
exponential reinforcing of past observations will be recovered for negative learning tem- 
peratures r < 0. 

Remark 1. In our context, the scheme (2.3) emerges quite simply as the diff'erential form 
of an exponentially discounted model for aggregating past payoffs. It is thus interesting 
to note that Leslie and Collins [19] and Tuyls et al. [34] obtained the dynamics (2.3) for 
T - \ from a quite different viewpoint, namely as the continuous -time limit of the Q- 
learning estimator 

t(a(t) = a) 

yait + 1) = yait) + y(t + 1) («„(f) - yait)) x (2.5) 

P(a(f) - a) 

where 1 and P denote respectively the indicator and probability of having chosen a at time 
t, and y(f) is an - /'')-summable series of time steps (see also Fudenberg and Levine 
[12]). The exact interplay between (2.3) and (2.5) will be explored in detail in Section 4; 
for now we simply note that (2.3) can be interpreted both as a model of discounting past 
information and also as a moving Q-average. 

While on this point, we should also highlight the relation between (2.5) and the moving 
average estimator of Cominetti et al. [10] which omits the factor P{a(t) - a) (or the similar 
estimator of Bravo [9] which has a state-dependent step size). As a result of this difference, 
the mean-field dynamics of [10] are scaled by the player's mixed strategy x^it) = P(a{t) - 
a), leading to the adjusted dynamics ya = Xa(t) (ua(t) - yait))- Given this difference in 
form, there will basically be no overlap between our results and those of Cominetti et al. 
[10], but we will endeavor to draw analogies with their results wherever possible. 



Note that the difference/differential equations (2.2)/(2.3) imply that initial scores decay (or grow) exponen- 
tially with time in the absence of external forcing, commensurately to the first payoff observation Ua(0). 
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2.3. The choice stage: smoothed best responses and entropy functions. Having estab- 
lished the way that an agent updates his assessment vector y e R'^ over time, we now turn 
to mapping these scores to mixed strategies x eX = lS.(A) in a smooth fashion. In the the- 
ory of discrete choice, this boils down to a smooth best response problem so our aim will 
be to give a brief overview of the related constructions suitably adapted to our purposes; 
for a more comprehensive account, see McFadden [21], Anderson et al. [3], or Chapter 5 
in Sandholm [29]. 

To motivate our approach, observe that a natural choice for the agent would be to always 
pick the action a e A with the highest score; however, this "best response" approach 
carries several problems: First, if two scores ya and yp happen to be equal (e.g. if there are 
payoff ties), then this mapping becomes a set-valued correspondence which requires a tie- 
breaking rule to be resolved (and is theoretically quite cumbersome to boot). Additionally, 
such a practice could lead to completely discontinuous trajectories of play in continuous 
time — for instance, when the payoffs Uaif) are driven by an additive white Gaussian noise 
process, as is commonly the case in information-theoretic applications of game theory; 
see e.g. Altman et al. [1]. Finally, since best responding generically results in picking 
pure strategies, such a process precludes convergence of strategies to non-pure equilibria 
in finite games. 

In view of the above, a common alternative to the "best response" x - argmax^gy^fja.) 
is to smooth things out using the Gibbs map G : R'^ — > X defined as"* 

exp(y„) 

Gciy) = -— , aeA (2.6) 

exp(3;^) 

(see e.g. Cominetti et al. [10], Fudenberg and Levine [12], Hofbauer et al. [15], Leslie and 
Collins [19], Marsili et al. [20], Mertikopoulos and Moustakas [24], Rustichini [28], Sorin 
[32] and many others for uses of this choice map in game-theoretic learning). Indeed, it is 
well-known that G{y) is the unique solution of the (strictly concave) maximization problem 

maximize 2^ xpyp - g{x), 

(2.7) 

subject to Xa>Q, Y^p^p- 1 , 

where the Boltzmann-Gibbs entropy g(x) - 2^ xp log xp acts as a control cost adjustment 
to the agent's average score y - Yip^pyj} (Fudenberg and Levine [12], van Damme [35]). 
In this way, G(y) can be viewed as a smoothed best response : if the control cost is scaled 
down by some small e > (i.e. g{x) is replaced by Eg(x) in (2.7)), then the resulting 
solution x'' = G{£^^y) of (2.7) represents a smooth approximation to the best response 
correspondence y i-> arg max^^ji{ya} as e — > 0. 

Interestingly, the Gibbs map can also be seen as a special case of a quantal response 
function in the sense of McKelvey and Palfrey [22] — or a perturbed best response in the 
language of Hofbauer and Sandholm [13]. To wit, assume that the agents' scores are 
subject to additive stochastic fluctuations of the form 

ya^ya+ ^a, (2.8) 

where the are independently Gumbel-distributed random variables with zero mean and 
scale parameter e > (amounting to a variance of e^7r^/6). Then, the choice probability 



Due to the entrenched terminology for the logit choice model, many authors call (2.6) the "logit" map. 
However, (2.6) actually describes the inverse logit (or logistic) distribution, so, to avoid inconsistencies, we will 
refer to (2.6) by the name of its originator. 
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Pa(y) of a e A (defined as the probability that a € A maximizes the perturbed score 
variable y^) will simply be 

Pa(y) = P{ya = max^jy;?) = Ga(s'^y). (2.9) 

As a result, ordinary best responses are again recovered in the limit where the magnitude 
of the perturbations (measured by the scale parameter e > of the Gumbel distribution) 
approaches 0. 

More generally, assume that the perturbations are not Gumbel-distributed but instead 
follow an arbitrary probability law with a strictly positive and smooth density function. In 
this context, Hofbauer and Sandholm [13] showed that the resulting choice probabilities 
Qa(y) = P(ya - max^ y^) solve the maximization problem 

maximize 2^ x^y^j - h(x), 

(EP) 

subject to Xq, > 0, Yj/i^/} - 1 > 

where the deterministic representation h of ^ h is a smooth, strictly convex function on 
X = A{A) whose Legendre-Fenchel conjugate (Rockafellar [27]) 

h*{y) = max,,x {S/j ^Jp " h{x)] , y e R^, (L-F) 

satisfies the potential equation Qa(y) - 

On account of the above, the choice map Q can be viewed either as a quantal response 
function to some perturbation process ^, or as a smooth approximation to arg max^y„ with 
respect to an admissible control cost adjustment h (if we take the strictly concave problem 
(EP) as our starting point). ^ Formally, we have: 

Deflnition 2.1. A function /z : X — > R U {+oo) will be called a generalized entropy function 
when: 

(1) h is convex and finite almost everywhere, except possibly on the relative boundary 
bd(X) ofX. 

(2) h is smooth on relint(X) and \dh(x)\ — » oo when x converges to bd(X). 

(3) The Hessian tensor Hess(/!) of h is positive-definite on relint(X). 

The Legendre-Fenchel conjugate h* of h as defined by (L-F) will be called the/ree entropy 
of h, and the map Q : IR'^ — > X, y i--> Q{y) = arg max^.^j^-j^^ x/^y^ - h{x)}, will be the choice 
map associated to h. Finally, a generalized entropy function h: X R will be called 
regular when a) its restriction to any subface X' of X is itself an entropy function, and 
b) the ratio h'(q + vt)/h"(q + vt) vanishes as ^ H- vf approaches hd{X') for all interior points 
q € relint(X') and for all tangent vectors v € T^X'. 

The fact that the choice map Q of an entropy functional is well-defined and single- 
valued is an easy consequence of the convexity and boundary behavior of h; the smoothness 
of Q then follows from the implicit function theorem (Rockafellar [27]). Thus, given that 
(EP) allows us to view Q{s^^y) as a smooth approximation to argmax^^Q. for s — > 0^, the 
class of choice maps that we will consider will be precisely the maps that are derived from 
entropy functionals in the sense of Definition 2. 1 . A few remarks are thus in order: 



These two viewpoints ai'e not equivalent because there exist cost functions h that do not arise as deterministic 
representations of perturbation processes f — see Hottauer and Sandholm [13] for a counterexample based on the 
log-entropy h{x) = - 
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Remark 1. In statistical mechanics and information theory, entropy functions are concave, 
SO Definition 2.1 actually describes negative entropies. Besides notational convenience, 
one of the main reasons for this change of sign is that entropy functions as defined above 
are essentially smooth functions of Legendre type (Rockafellar [27]), with the added non- 
degeneracy condition of having a strictly definite Hessian matrix. In fact. Definition 2.1 
with our chosen sign convention bears very close ties to the class of Bregman functions, 
a key tool in interior point and proximal methods in optimization; for a more detailed 
account, see e.g. Rockafellar [27], Auslender et al. [4], Alvarez et al. [2] and references 
therein. 

Remark 2. Examples of entropy functionals abound; some of the most prominent ones are: 



1. The Boltzmann-Gibbs entropy: h(x) - 2^ JC/jlogxg. (2.10a) 

2. The log-entropy: /i(x) = - ^^logjcg. (2.10b) 

3. The TsalHs entropy:'' h(x) = (1 - q)-^ Y^pi^ii - xp, <q< 1. (2.10c) 

4. The Renyi entropy:^ h(x) = {q - l)"' log < ^ < 1. (2.10d) 

Except for the Renyi entropy, all of the above examples can be written in the convenient 
form h(x) = 2/3 (^ixjs) for some function 6: [0, -i-oo) — > IR U {+oo) with the properties: 

(1) 6 is finite and smooth everywhere except possibly at 0. 

(2) 0'{x) -oo as X -> 0+ and 0"(x) > for all x > Q. 

When h can be decomposed in this way, we will follow Alvarez et al. [2] and say that h is 

decomposable with Legendre kernel 6. 

Remark 3. The regularity requirement of Definition 2. 1 is just a safety net to ensure that h 
behaves well with respect to restrictions to subfaces of X and does not exhibit any patholo- 
gies near bd(X). Of the examples (2.10) only the log-entropy (2. 10b) is not regular because 
it is identically equal to +oo on every proper subface of X. 

Remark 4. Another technical point that underlies Definition 2.1 is that we are implicitly 
assuming that h is defined on an open neighborhood of X in IR'f so that the derivatives of h 
are well-defined. The reason that we are not making this assumption explicit is that it may 
be done away with as follows: Let Ao = A \ {a^)] and consider the canonical projection 
projg: R'^ — > IR'^° defined in components as projo(xo, xi, . . . , ;c„) = x_o = (xi,...,x„). 
Then, the image Xq = {we R'^" : > and '^/j ^ 1) of X under projo will be homeo- 
morphic to X and the inverse to projg on Xq will be the injective immersion to : R'^" — > R'^ 
with L(){w\, . . . , w„) - (1 -Yif,w^,wi, . . . ,w„). In view of the above, the directional deriva- 
tives of h on X may be defined by means of the pullback /iq = l^i - h o iq as = |^ 
(and similarly for the Hessian of h). 

As one would expect, if h is defined on an open neighborhood of X, we will have 
= It ~ above discussion reduces to treating xo - 1 - vv^ as a dependent 

variable. Conversely, any smooth function h: X R can be extended smoothly to all of 
R:f (e.g. via mollification), in which case it is easy to see that the directional derivatives 
^ - -1^ are independent of the extension and the equality 4^-4^-4^ still holds 

^The Tsallis and Renyi entropies are not well-defined for q = I, but they botii approach the standard Gibbs 
entropy as — > 1, so we will use the definition (2.10a) for q = \ m (2. 10c-2.10d). 
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on X. Consequently, we lose no generality in assuming that h is in fact defined on an 
open neighborhood of X in IR:f , and we will do so throughout the rest of the paper unless 
explicitly stated otherwise. However, the "reduced" coordinates w - projQ(x) and the 
associated derivations will be very important in our calculations, so their introduction 
above is not just a technical triviality. 

2.4. Entropy-driven learning dynamics. Combining the results of the previous two sec- 
tions on how to assess the long-term performance of an action and how to translate these 
assessments into strategies, we obtain the general class of entropy -driven learning pro- 
cesses : 

Jo (ERL) 

x(t) = Q(ym 

where Q is the choice map of the driving entropy /i ; X — > R and T is the model's learning 
temperature. 

From an implementation perspective, the difficulty with (ERL) is twofold: First, for 
a given entropy function h, it is not always practical to write the choice function 2 in a 
closed-form expression that the agent can use to update his strategies.^ Furthermore, even 
when this is possible, (ERL) is a two-step computationally intensive process which does 
not allow the agent to update his strategies directly. The rest of this section will thus be 
devoted to writing (ERL) as a continuous-time dynamical system on X that can be updated 
with minimal computation overhead. 

To that end, it will be convenient to work with a modified set of variables which mea- 
sure the long-term difference in performance between an action jj e A and a "flagged" 
benchmark action ao e A. Formally, letting ^lo - A \ {q-q) as in Remark 4 above, the 
relative score of an action ^ e Aq will be the difference 

Z/j = y^/ - yo, (2.11) 

or, more concisely, z - J^oiy) where tiq : R'^ — » R'^" is the submersion 

7To(yQ,yi,...,yn) = (yi -yo,---,yn-yo) = (zu . ..,z«). (2.12) 

Thereby, the evolution of z over time will be 

2p = j/^ - jo = Am^ - TZf,, (ZD) 

where now Am^ denotes the associated payoff difference Am^ = Ufi - uo, ^ -^o- 

The main advantage of introducing the variables z is that even though the choice map 
Q: R'^ — > X is not injective (and thus does not admit an inverse),^ there exists a smooth 
embedding go: R-^" Xq = projo(X) = {x e R-^" : > 0,Y,;jXij < I] such that the 
foHowing diagram commutes: 

IP.A " 



X 

'oi |pi-oj„ (2.13) 



X 



The case of the Bohzmann-Gibbs entropy is a shining (but, ultimately, misleading) exception to the norm. 
^In fact, Q is constant along (1, . . . , 1): adding c e R to every component of y e R'^ will not change the 
solution X = Q(y) of (EP). 
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With such an embedding at our disposal, we will then be able to translate the dynamics 
of z e IR'^" to Xo, and thence to X via the inverse to of projg on X: to(wi, . . . , w„) - 
(1 -Ii;,vv^,wi,...,w„). 

To construct Qo itself, note that (EP) may be rewritten in terms of = y^- yo as: 

maximize I]^ w^z^ - ho(wi , . . . , w„), 

(EPo) 

subject to Wf^ > 0, Yjfi^n ^ ^, 

where ho = C^h - ho lq, i.e. ho{wi,. . . , w„) = h{l - 2^ w^, wi, . . . , w„). Similarly to (EP), 
the (unique) solution of (EPo) will lie in the interior int(Xo) of Xq, so we will have x - Q(y) 
iff 

, ^dh^^dl^_dh_^ ^2.14) 
dWf^ dXfj (9x0 ' 

i.e. iff z = F(){w) where Fq: int(Xo) — > R'^" denotes the gradient Fo(w) = V hoiw), 
w e int(Xo). As it turns out, the required embedding Qo is simply the inverse of Fo'- 

Lemma 2.2. Let h: X ^ R be a generalized entropy function. Then, the map Fq = 
V/io: int(Xo) — > IR'^° defined above is a dijfeomorphism whose inverse Qo = -Fy' makes 
the diagram (2.13) commute. 

Proof. Proof. The fact that Fo is a continuous bijection with continuous inverse follows 
from the general theory of Legendre-type functions — see e.g. Theorem 26.5 in Rockafellar 
[27]; the diagram (2.13) then commutes on account of the equivalence x - Q{y) o no(y) — 
Fo(projo(x)). Finally, to show that Fo is indeed a diffeomorphism, note that the Jacobian 
of Fo is just the Hessian of ho, and with Hess(/z) strictly positive-definite by assumption, 
our claim follows from the inverse function theorem (see e.g. Lee [18]). □ 

Having established a diffeomorphism between the variables and w^, let /j^^ denote 
the elements of the corresponding Jacobian matrix JFo - Hess(/!o), i.e. 

^ _dz^, _ d^ho _ d^h ^ d^h _ d^h _ d^h 
dwy dWfjdwy dx^dxy Sxq dxodx^ dxodxy 

Then, letting /z'"' — ^ denote the inverse of /i^v, and combining the learning scheme 
(ERL) with the evolution equation (ZD), we obtain the (unilateral) entropy-driven learning 
dynamics: 

x^^w^^ 2^ ^z, = (Am, - rzv) , (2.16) 

where, as before, Auy - Uy - uo and Zy - |f ~ 5^- ^ Therefore, if the agent's payoffs 
are coming from a finite game (5 = ®(3Nf, A, u), our previous discussion yields the class of 
entropy-driven game dynamics 



Xkfi 



= 2* h^/ix) iuty(x) - «i,o(x)) - r 2' ''k^^'^l^ - 1^) ' ^ED) 

where now hi^ : X/t ^ R is the entropy function of player k (generating the corresponding 
player-specific choice map : IR'^' Xi^) and /j^*^ is the inverse Hessian matrix of /i^ 
defined as in (2.15). 

These dynamics will be the main focus of the rest of our paper, so some remarks are in 
order: 



'Note that xq = - i^. so the action oq e A'n not being discriminated against). 
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Remark 1. To get some trivial book-keeping out of the way (and to keep our notation as 
light as possible), note that the player-specific entropy functions hf, may be encoded in 
the aggregate entropy h: X ^ R with h(xi, . . . ,xn) — Yjk^'^kixk), e X^. Likewise, if 
we set X = rijt^A replace A with A* = Uk-^t in Definition 2.1, then the player- 
specific choice maps Qt : R'^' — > X^ may themselves be encoded in the composite choice 
map Q: R"^' = Ylk 'R'^' ~* ^ associated to h. Therefore, whenever we mention entropy 
functions and choice maps in the context of a game (and not simply in a discrete choice 
problem), it should be understood that we are referring to the above construction. 

Remark 2. It is also important to note that the dynamics (ED) admit global solutions, 
i.e. solutions that remain in int(Xo) for all f > 0. This can be proven directly using 
the differential system (ED), but the score representation (2.3) probably yields a more 
transparent view: since the payoff streams Uait) are Lipschitz and bounded,"' the scores 
ya(t) will remain finite for all f > 0, so interior solutions x(t) = Q(y(t)) of (ED) will be 
defined for all f > themselves. ' ' 

Remark 3. The previous remark brings up an important distinction between interior and 
non-interior orbits: strictly speaking, (ED) is only defined on int(Xo), so boundary initial 
conditions must be handled with more care. To address initial conditions x{Q) e X with 
arbitrary support A' = supp(ji[:(0)) £ A, it will be convenient to assume that the entropy 
h is regular; in that case, by restricting (EP) to the subface X' = A(A') of X, we obtain 
a similarly restricted choice map Q' : R'^ — > X' and the agent may proceed by updating 
the scores of the supported actions a e A' in (ERL). In this way, every subface X' of 
X becomes an invariant manifold of (ERL)/(ED), so entropy-driven dynamics are seen to 
belong to the general class of imitative dynamics introduced by Bjornerstedt and Weibull 
[6] (see also Weibull [36]). 

Remark 4. In addition to tuning their learning temperature T > 0,'' players can also try to 
sharpen their response model by replacing the choice stage of (ERL) with 

Xk = Qkimyk) (2.17) 

for some rjk > 0. As can be seen from (EP), these choice parameters may then be viewed 
as (player-specific) inverse choice temperatures: as //^ — > oo, the choices of player k freeze 
down to a "best-responding" to the stimulus y, whereas for rfk — > 0, player k mixes ac- 
tions uniformly, without regards to their performance scores. On the other hand, the same 
reasoning that led to (ED) also yields the choice-adjusted dynamics 

= m h''^{x) {Uky{x) - Uk,u(x)) -TjkT l^^ix) - j ■ (ED;,) 

We thus see that the inverse temperature rji^ of the player's choice model and the temper- 
ature T of their learning model (2.3) play very different roles on the resulting learning 
dynamics (ED,,). The learning temperature T affects only the entropic correction term of 
(ED,,) whereas rjk affects all terms of (ED,,) commensurately; in fact, rjk can also be seen 
as a player-specific change of time, an observation which will be crucial in considering 
players with different update schedules in Section 4. 



Importantly, this property remains true in tlie case of several agents involved in a finite game. 
' 'interestingly, for T = 0, this can be seen as an alternative proof of Theorem 4. 1 of Alvarez et al. [2] on the 
existence of global solutions in Hessian-Riemaimian gradient descent dynamics. 

1 2 

In fact, we could also consider the case of player-specific learning temperatures T^, but we will not do so 
for simplicity. 
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We will close this section by noting that even though the dynamics (ED) are fully up- 
dateable on X (as opposed to the otherwise equivalent process (ERL) which interweaves X 
and R'^), the RHS of (ED) still contains a non-explicit step in the calculation of the inverse 
matrix h^^. Given that the computational complexity of inverting a matrix is polynomial in 
the size of the matrix, this does not pose much of a problem in practical applications (after 
all, it is the number of players that usually explodes, not the number of actions per player). 

That said, (ED) can be simplified considerably if the entropy function which is driving 
the process is itself decomposable. Indeed, assume that h(x) = 2^ 0(xj3) for some non- 
degenerate Legendre kernel 6: [0, H-oo) — > IR U {-Hoo) as in Remark 2 following Definition 
2.1. In that case, (2.15) readily yields 



h 



dXfjdxy 



dxl 



dx(jdx^ dxiydxy 
so /i'"' can be calculated from the following lemma: 



'(Xu) 



'(jco), 



(2.18) 



Lemma 2.3. Let 

be: 



qf,6^y + qo with qo, qi 

6, 



q„ > 0. Then, the inverse A'^'' ofA^y will 

(2.19) 



Qh 

q^qv 



where Qh is the harmonic aggregate 2/,' = Tj"a=o1a- 
Proof. Proof. By simple inspection, we have: 

A^yA"^ = ^iqiiS^v + qoXSyplqy " Qi,/{qyqp)) 

= Xlv [ll^^l^^^^vlq^ + loSyp/qy - q^QhSpyliqyqp) - qoQh/iqvqpi) 

= ^/jp + qoqp^ - Qhq'p^ - qoQhqp^ qy^ = s^p. 



dh 



In view of the above inversion formula applied to (2.18), and setting Zkp - ^^^^ - 
O'(xkfi) - O'ixkfi) in (ED), some algebra finally yields the entropic dynamics with kernel 6: 



1 



r{xua) 



Mfc 



aix)-eh{Xk)y — 



k Uk/six) 



I/; e"ixt/3) 

T 



e'(Xka) - &h(Xk) 



9'(Xkp) 



f/5 0"{Xkp) 



(EDe) 



where 0^' denotes the harmonic aggregate ©^^'(x^) - 1/0" (-^jt/?)) -'^^ Hence, in the im- 
portant case of the Boltzmann-Gibbs kernel 6{x) = x log x, we readily obtain the temperature- 
adjusted replicator equation 



Xka ~ Xka 



Uka 



a(x) - ^ XkpUkpix) - T Xka log Xka ' ^ Xkp log Xkfj 



(T-RD) 



which, for T - 0, freezes to the ordinary (asymmetric) replicator dynamics of Taylor and 
Jonker [33]: 



Xka ~ Xka 



Uka(x) - 'Y XkpUkpix) 



(RD) 



'"'Needless to say, 0," is not a second derivative; we just use this notation for visual consistency. 
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Figure 1 . Phase portraits of the temperature-adjusted rephcator dynamics (T-RD) in 
a 2 X 2 potential game (Nash equilibria are depicted in dark red and interior rest points in 
light/dark blue; see labels for the game's payoffs). For high learning temperatures 7" » 0, 
the dynamics cannot keep track of payoffs and their only rest point is a global attractor 
which approaches the barycenter of X as T — » +00 (corresponding to a QRE under stochas- 
tic perturbations of very high magnitude). As the temperature drops to around T » 0.935, 
this attractor becomes unstable and undergoes a supercritical pitchfork bifurcation (a phase 
transition) resulting in the appearance of two asymptotically stable QRE that converge to 
the strict Nash equilibria of the game as T — * 0^. For negative temperatures, the non- 
equilibrium vertices of X become asymptotically stable (but with a very small basin of at- 
traction), and each of them gives birth to an unstable equilibrium in a subcritical pitchfork 
bifurcation. Of these two equilibria, the one closer to the game's interior Nash equilibrium 
is annihilated with the pre-existing QRE at T x -0.278, and as T — > -oo, we obtain a 
time-inverted image of the T — ► +00 portrait with the only remaining QRE repelling all 
trajectories towards the vertices of X. 
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3. Entropy-driven game dynamics and rationality 

In this section, our aim will be to analyze the entropy-driven dynamics (ED) from the 
point of view of rational agents, mostly looking to determine their asymptotic convergence 
properties with respect to standard game-theoretic solution concepts. Thus, in conjunction 
with the notion of Nash equilibrium, we will also focus on the widely studied concept of a 
quantal response equilibrium : 

Definition 3.1 (McKelvey and Palfrey, 1995). Let (5 = m) be a finite game in 

normal form and let Q : R'^ Xhe a regular choice function. We will say that q e X is a 
quantal response equilibrium ( QRE) of (5 with respect to Q (or a Q-equilibrium for short) 
when, for some g >0, 

q^Q(Mq)), (QRE) 
where u(q) e 'R'^' denotes the payoff vector of the profile q e X. More generally, if 
(5' = &(N, A', u\ji') is a restriction of (5, we will say that q € X is a restricted QRE of (5 
if it is a QRE of®'. 

The scale parameter g > will be called the rationality level of the QRE in question. 
Obviously, when g = 0, players choose actions uniformly, without any regard to their 
payoffs; at the other end of the spectrum, when ^ — > oo, players become fully rational and 
the notion of a QRE approximates smoothly that of a Nash equilibrium. Finally, one could 
also consider negative rationality levels, in which case players become anti-rational: for 
g < 0, the condition x - Q{gu(x)) characterizes the QRE of the opposite game -® - 
(N, A, -u), and as p — > -oo, these equilibria approximate the Nash equilibria of -®. 

To make this approximation idea more precise, let q* e X he a Nash equilibrium of a 
finite game ® and let y : U — > X be a smooth curve on X defined on a half-infinite interval 
of the form U = [a, +oo), a e R. We will then say that y is a Q-path to q* when y{g) is 
a Q-equilibrium of ® with rationality level g and lim^^co jig) - g'*; in a similar vein, we 
will say that q e Xis a Q-approximation of q* when q is itself a Q-equilibrium and there 
is a g-path joining q to q* (van Damme [35] uses the terminology approachable). 

Example 1. By far the most widely used specification of a QRE is the logit equilibrium 
which corresponds to the Gibbs choice map (2.6): in particular, q e X will be a logit 
equihbrium of ® when qi^a = expigukaiq)) / Yjp ^'^ViB'^kpiq)) for all a € Ak, k eJ^. 

Our first result links the rest points of (ED) at temperature T with the game's restricted 
QRE: 

Proposition 3.2. Let ® = ^('H,A,u) be a finite game and let h: X R be a regular 
entropy function with choice map Q: R"^ — » X. Then: 

(1) For T > 0, the rest points of the entropy-driven dynamics (ED) coincide with the 
restricted QRE of(S> with rationality level g — l/T. 

(2) For T — 0, the rest points of (ED) are the restricted Nash equilibria of (5. 

(3) For T < 0, the rest points of (ED) are the restricted QRE of the opposite game 
-®. 

Proof. Proof. Since our focus is on restricted equilibria, it clearly suffices to prove the 
above correspondences for interior rest points; since the faces of X are forward-invariant 
under the dynamics (ED), the general claim then follows from passing to an appropriate 
restriction ©' of ®. 

To that end, let T > and note that Eq. (ZD) on the evolution of the relative scores 
Zkn allows us to characterize the rest points of (ED) by means of the equation Amj:^(x) - 
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Tzkii - 0; equivalently, in terms of the absolute scores yi^a, we will have Ukaix) - Tyua - 
Uk/iix) - Tyiij} for all a,/3 e A^, k e J^. However, with x = Q(y), we will also have 
yka- ^ ykH - so we readily obtain Mto(x) - = Uk/six) - T-^; in turn, this 

shows that x = Q{T^^u{x)), so x is a Q-equilibrium of (S with rationality level g - l/T. 

For r = the result is immediate because (ZD) shows that x e X is an interior rest 
point of (ED) if and only if Uka{x) = Ukpix) for all a, [5 e Ak, k e J^, i.e. if and only if x 
is an interior Nash equilibrium of (5 (recall that the Hessian matrix of h is non-singular). 
Finally, the time inversion f i-> -f in (ED) is equivalent to the inversion u —u, T i-» -T, 
so our claim for negative T follows from the case T > 0. a 

Remark 1. Proposition 3.2 shows that the temperature T of the dynamics (ED) plays a 
double role: on the one hand, it determines the discount (or reinforcement) factor in the 
players' assessment phase (2.3), so it reflects the importance that they give to past observa- 
tions; on the other hand, T also determines the rationality level of the rest points of (ED), 
measuring how far the stationary points of the players' learning process are from being 
Nash. Perhaps unsurprisingly, this dual role of the temperature is brought to the forefront 
by the probabilistic/perturbed interpretation of quantal responses as choice probabilities 
in the case of stochastically perturbed payoff's. Indeed, recalling the relevant discussion 
of Section 2.3, we see that a QRE with rationality level g = T"' corresponds to best re- 
sponding in the presence of a noise process with standard deviation s cc p ' - T. On that 
account, the players' learning temperature simply measures the inherent variance (inverse 
rationality) of a QRE, just as the physical notion of temperature measures the variance of 
the random motions of the particles that make up a thermodynamic system (e.g. an ideal 
gas following Maxwell-Boltzmann statistics). 

Of course, stationarity does not capture the long-term behavior of a dynamical system, 
so the rest of our analysis will be focused on the convergence properties of (ED). To that 
end, we begin with the special case of potential games where the players' payoff functions 
are aligned along a common potential function as in Eq. (1.2). In this setting, our first 
result is that for small temperatures, the game's potential function is "almost" increasing 
along the solution orbits of (ED): 

Lemma 3.3. Let (5 = ®([Nf, A, u) be a finite potential game with potential U, and let 
/i : X — > R be a generalized entropy function. Then, the function F(x) = Th(x) — U(x) 
is Lyapunov for the entropy-driven dynamics (ED): for any interior orbit x(t) of (ED), we 
will have j^F{x{t)) < with equality if and only ifx(0) is a QRE of (5 (or for T < 0). 

Proof. Proof. By expressing F in the reduced coordinates Wk/i - Xk^ that we used in the 
derivation of (ED) (see also Remark 4 following Definition 2.1), we readily obtain: 

'i-Y.kZ Z;:^^ - (^^ - - ^^)' (3.1) 

with the second equality following from the fact that ® is a potential game, so = 

by (1.2). Thus, with Hess(/z) > 0, (3.1) implies that F(x) < with equality if and only if 

tMkiA^) = 7^5^ = T^i-wt ~ flt^)' ^ ^ °^ ® ~® T <^'^ cf 

the proof of Proposition 3.2). □ 

When the players' entropy function is regular. Lemma 3.3 can be easily extended to 
orbits lying on any subface X' of X simply by considering the restricted QRE of the game 
that are supported in X' . Even in that case however. Lemma 3.3 makes no distinction 
between positive and negative temperatures and simply shows that the dynamics (ED) will 
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tend to move towards the minimizers of F on X'. What changes with the sign of T is the 
relation that these minimizers have with regards to restricted QRE: for T < Q, the only 
local minimizers of F are the vertices of X (which are themselves pure restricted QRE),'** 
whereas for T > Q, the restricted QRE of © that are supported in a subface X' of X coincide 
with the local minimizers of F\x'- Formally, Lemma 3.3 and the above reasoning give: 

Proposition 3.4. Let h: X ^ R be a regular entropy function and let x{t) be a solution 
orbit of the associated entropic dynamics (ED) for a potential game (5. Then: 

(1) For T > 0, x(t) converges to a restricted QRE of (5 with the same support as x(0). 

(2) For T — 0, x(t) converges to a restricted Nash equilibrium whose support is con- 
tained in that ofx{0). 

(3) For T < 0, x(t) converges to a vertex ofX or is stationary (if x(0) is a restricted 
QREof-(S>). 

The above proposition will be our main result for potential games, so some remarks are 
in order: 

Remark 1. By continuity, the phase portrait of the entropy -driven dynamics for small tem- 
peratures (positive or negative) will be broadly similar to the base case T = (at least, in 
the generic case where there are no payoff ties in (5). Accordingly, the main difference be- 
tween positive and negative temperatures is that for small T < the dynamics converge to 
a bona fide (pure) Nash equilibrium for most initial conditions (except for a small basin of 
attraction around each vertex of X which pulls the dynamics to non-equilibrium vertices), 
whereas for small T > 0, interior solutions of (ED) always converge to a g-approximation 
of a Nash equilibrium (see also Fig. I). As we shall see in Section 4, the fact that the 
entropy-driven dynamics always converge to the vicinity of a Nash equilibrium for small 
r > will be crucial in the presence of imperfect payoff information and/or stochastic 
fluctuations. 

Remark 2. Interpreting the game's potential U as the internal energy of a thermodynamic 
system, then, modulo a change of sign,the Lyapunov function F{x) = Th(x) - U(x) is 
known in statistical physics as the (Helmholtz) free energy and measures the useful work 
that can be obtained from a thermodynamic system at constant temperature (Landau and 
Lifshitz [17]).'^ In this context, the principle of energy minimization states that the free 
energy of an isolated system never increases, so Lemma 3.3 may be viewed as a corollary 
of the Second Law of Thermodynamics: under constant temperature, the free energy of 
the system decreases until it reaches a thermal equilibrium. 

The previous discussion establishes a fundamental qualitative difference between posi- 
tive and negative learning temperatures in potential games: for T < 0, every vertex of X is 
attracting in (ED), while for T > 0, the dynamics can only converge to interior points. As 
we show below, this behavior actually applies to any finite game: 

Proposition 3.5. Let h: X ^ R be a regular entropy function. Then: 

(1) For negative temperatures T < 0, every vertex q € X is attracting in the entropic 
dynamics (ED). 

'"^Perhaps the easiest way to see is this is to note that F is subharmonic: = for all a e Ai^, k e J^, on 

'"kcr 

r.2, 

account of U being multilinear, and > on account of h having a positive-definite Hessian. 

'^Recall that our sign convention for the entropy is the opposite of physics and probability; furthermore, 
potentials are minimized in physics, so h and U should be replaced by -h and -U respectively, yielding the 
familiar expression F = U - Th (Landau and Lifshitz [17]). 



18 



PIERRE COUCHENEY, BRUNO GAUJAL, AND PANAYOTIS MERTIKOPOULOS 



(2) For positive temperatures T > 0, any Lo-litnit point of an interior solution orbit 
x(t) is itself interior; in fact, the oj-limit set of int{X) and the boundary bd(X) ofX 
are separated by neighborhoods. 

Proof. Proof. Our proof will be based on the dynamics (ZD) for the relative scores z^.^, 
p e Akfi = Ak \ (ffjt.o). In integral form, we have: 

Zk^(t) ^ Zk^{0)e-^' + f e-''^'-'^^UkM^))ds, (3.2) 
Jo 

so, with Auit^ bounded on X, the last integral will be bounded in absolute value by ^(1 - 
e"^') for some > 0. We will thus have 

(z,^(0) + Mkir) e-^' - Mk/T < zk^(t) < Mk/T + (zi^(O) - Mk/l) e"^', (3.3) 

and for T > Q, any w-limit point of (ZD) will lie in the rectangle Cj - {z & Yik 1^'^'° ■ 
Izkfil < Mk/T}. However, the image of Cj under the reduced choice map Qo: Yl 'R'^'" ~* 
Xq will be a compact set that is wholly contained in the interior of Xq, thus proving our 
assertion for T > Q. 

On the other hand, for T < 0, (3.3) shows that if we pick z*^(0) < -Mk/\T\, then 
we will have lim,^co Zkij(t) - for all p e Akfl, k e "N. Since the set Uj = {z € 
Uk^'^'-" ■ Zkii < -Mk\T\-^] is a neighborhood of (-oo, . . . , -oo) in Fit which is 
mapped diffeomorphically by Qq to the relative interior of a neighborhood of (0, . . . , 0) 
in Xo, it follows that the pure vertex q - (ai,o, . ■ . , Q'a'.o) of X attracts all nearby interior 
solutions of (ED). By restriction, this property will hold on any subface of X which contains 
q, and with the choice of flagged actions Okfi e Ak being arbitrary, our proof is complete. 

□ 

This dichotomy in the behavior of the entropic dynamics (ED) for positive and negative 
temperature ties in with the following result which is of independent interest: 

Proposition 3.6. Let h: X R be a generalized entropy function. Then, there exists a 
volume form Vol/, on int(X) such that ifUo Q X is relatively open in X and cl(Uo)r\hd(X) — 
0, then: 

Vol,,(t/,) = Wolh(Uo) cxp(-AoTt), (3.4) 

where Aq - card(]Jj. ^1^) - card(>0 - 2it(card(^*) - 1), and U, = {x(t) : x(Q) e Uq}. 
Hence, the entropy-driven game dynamics (ED) are contracting for T > 0, expanding for 
T < and incompressible iffT — 0. 

Proof. Proof. Again, our proof will be based on the dynamics (ZD) for the relative score 
variables Zkjj, p e Akfl = Ak \ {ak,o] of (2. 11). Indeed, if Vq is an open set of Hit R'^' and 
Wk^ - Auk/jix) - Tzkfi denotes the RHS of (ZD), then, by Liouville's theorem, we will have 

-Vol(y,)= f divWdV, (3.5) 
at Jv, 

where dV = Ak.t^dzkfi is the ordinary Euclidean volume form of Y\k Vol denotes 

the associated (Lebesgue) measure, and V, is the image of Vo at time f under (ZD), viz. 
Vt - {z(f) : z(0) e Vo). However, since Am^^ does not depend on Zk (because m^^ and 

dW?, 

Uko themselves do not depend on Xk), we will have = -T. Hence, summing over all 
p € Akfl, k eJ^,we obtain div W = - 2^.(card(^i) - l)T = -AqT, and by integrating, we 
obtain the volume evolution equation Vol(y,) - Vol(Vo) exp{-AoTt). 
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In view of the above, let Vol/, = projo(2o')* Vol be the push-forward of the Euclidean 
volume Vol(-) to mt{X) via the diffeomorphism Qq' ° proJo ■ iiit(X) — > Yik 'R'^'' i-^- 
Vol/,(f/) = Vol(go'(projo(t/))) for any relatively open set U e mt(X). Then, taking Vo 
such that projo(C/o) = 2o(Vo), our assertion follows from the volume evolution equation 
above by recalling that projo(x(f)) = Qo(z(t)). □ 

Interestingly, in the special case of the Boltzmann-Gibbs entropy at zero temperature. 
Proposition 3.6 yields the classical result that the asymmetric replicator dynamics (RD) 
are incompressible and hence do not admit interior attractors (Hofbauer and Sigmund [14], 
Ritzberger and Weibull [26]).'^' We thus see that incompressibility characterizes a much 
more general class of dynamics and, in our learning context, it is simply a consequence of 
the players assigning a uniform weight to their past observations (neither discounting, nor 
reinforcing them). 

That said, in the case of the replicator dynamics, we have a significantly clearer picture 
for the stability and attraction properties of a game's equilibria thanks to the folk theorem 
of evolutionary game theory (Hofbauer and Sigmund [14]). In particular, it is well known 
that: 

(1) If an interior trajectory converges, its limit is Nash. 

(2) If a state is Lyapunov stable, then it is also Nash. 

(3) A state is asymptotically stable if and only if it is a strict Nash equilibrium." 

By comparison, in the more general context of the entropy-driven game dynamics (ED), 
we have: 

Theorem 3.7. Let (8 = ®(3Nf, A, u) be a finite game and let h: X R be a regular entropy 
function with choice map Q: Ylk 'R'^' ~* ^- Then, the entropy-driven dynamics (ED) have 
the following properties: 

(1) For positive temperatures T > 0, if q € X is Lyapunov stable then it is also a QRE 
of ©; moreover, if q is a Q-approximate strict Nash equilibrium and T is small 
enough, then q is also asymptotically stable. 

(2) For T — 0, if q € X is Lyapunov stable, then it is also a Nash equilibrium of (5; 
furthermore, q is asymptotically stable if and only if it is a strict Nash equilibrium 
of(S>. 

(3) Finally, for T < 0, a profile q € X will be asymptotically stable if and only if it is 
pure (i.e. a vertex ofX); any other rest point of (ED) is unstable. 

Proof. Proof. Our proof will be broken up in three parts based on the temperature of the 
dynamics (ED): 

Positive temperatures. Let T > Q and assume that q e X is Lyapunov stable (and, hence, 
stationary). Clearly, if q is interior, it will be a QRE of (5 by Proposition 3.2 so there is 
nothing to show. Suppose therefore that q G bd(X); then, by Proposition 3.5, we may pick 
a neighborhood U of q in X such that cl(U) does not contain any w-limit points of the 
interior of X under (ED). However, since q is Lyapunov stable, any interior solution that is 
wholly contained in U will have an w-limit in cl{U), a contradiction. 



This does not hold in the symmetric case: there the proof breaks down because the symmetrized payoff 
Ua(x) depends on ,v„. 

"Recall that q e X is said to be Lyapunov stable (or stable) when for every neighborhood U of qin X, there 
exists a neighborhood V ofqinX such that if x{0) e V then x{t) E U for all t >0. Moreover, q is called attracting 
when there exists a neighborhood UofqinX such that lim,_,oo x{t) = qif x(0) e U, and q is called asymptotically 
stable when it is both stable and attracting. 
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Regarding asymptotic stability, we will make the simplifying technical assumption that 
the entropy function h is decomposable with the same Legendre kernel 6 for all players 
(the proof is entirely similar in the general case, but significantly more painful to write 
down). Assuming further that q* = (ci.o, ■ ■ ■ , Q'a'.o) is a strict Nash equilibrium of (5 and 
let q = q(T) e X be a ^-approximation of q* with rationality level g = I /T. Then, letting 
Witfi - Auiift - Tzkii denote the RHS of the dynamics (ZD), a simple differentiation yields: 



(v 



-T if £^k,v^fi, 

if{ = k,vi^fi, (3.6) 

hJ^l^jtTf.^^i'f otherwise, 

where hTiQ) is the inverse Hessian matrix of h evaluated at q and -J^ = -J^ -J^ as 

before. In particular, using the inversion formula of Lemma 2.3, we will have 

, vp. . ^ _Jvp 

' 9"iqe,) 9"(qey)e"iqcpy 

where &l{qc) denotes the harmonic aggregate Q'^iqc) - {Tjp ^l^'i^ej}))^^ ■ 

Since ^ is a ^-approximation of the strict equilibrium q* - {aifi, . . . , a^f)), we will also 
have qi^n = qt^iT) — > q^^ - and q^fl iIq - 1 as T — > 0^. Moreover, recalling that q is 
a QRE of (5 with rationality level g - 1 /T, we will also have Auk^(q) = T 0' {qkfi)—T ff {qtfi), 
implying in turn that Td'{qkft{T)) — » Autftiq*) < as T — > 0^. However, with h regular, 
the Legendre kernel of h will satisfy ff(x)l9"{x) ^ as x — > 0^, whence we obtain 

1 .^WD) 1 , ^3^^^ 



Te"iqk^(T)) e"(qk,(T)) Te'(qt,(T)) Aukf^iq*) 

Thus, on account of (3.7) and (3.8), the off-diagonal elements of (3.6) will be o(T) as 
r ^ 0^, so, by continuity, the eigenvalues of the Jacobian of the vector field W evaluated 
atq = q{T) will all be negative if T > is small enough. As a result, q will be a hyperbolic 
rest point of (ZD), so it will also be structurally stable by the Hartman-Grobman theorem, 
and hence asymptotically stable as well. 

Zero temperature. For T - Q, let q be Lyapunov stable so every neighborhood U of q 'm 
X admits an interior orbit x(f) that stays in U for all f > 0; we then claim that q is Nash. 
Indeed, assume ad abusrdum that o-j- o e supp(^) has u^fliq) < Uf^^iq) for some fi e Ak 
and let U he a neighborhood of q such that x^fi > qk,o/2 and Auk/tix) > m > for all 
X e U. Then, picking an orbit x(t) that is wholly contained in U, the integral equation (3.2) 
gives Zk^(t) > Zk,oiO) + nit, implying in turn that Zkfi(t) +°° as f — > oo. However, with 
Zk/A = ^ — and h regular this is only possible if Xki_,(t) ^ 0, a contradiction. 

Assume now that q - (ak^i, . . . , ak,N) is a strict Nash equilibrium of (5 and let Akfl = 
•A-k \ {akfi] as usual. To show first that q is Lyapunov stable, it will be again convenient to 
work with the relative scores Zkfi and show that if m e R is sufficiently negative, then every 
trajectory z{t) that starts in the open set U„, - {z & Y\k ■ Zkfi < '«) always stays in 
Um', since U,„ is a neighborhood of (-oo, . . . , -oo) in Ylk R'^'", this is easily seen to imply 
Lyapunov stability for q in X. 

In view of the above, pick m e R so that Aukfi(x(z)) < -e < for all z e Um and let 
T,„ = inf{f : z(f) i U,„} be the time it takes z(f) to escape Um- Then, if t„, is finite and 
t < T,„, the integral form (3.2) of the relative score dynamics (ZD) readily yields 

Zk,,(t) = ZkniO) + f Aukf^iQaizis))) ds < Zi^(O) - st < m for all yu e Akfi, keJi. (3.9) 
Jo 
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Thus, substituting t„, for t in (3.9), we obtain a contradiction to the definition of t,„ and we 
may conclude that z{t) always stays within U„, if m is chosen negative enough - i.e. q is 
Lyapunov stable. 

To show that q is in addition attracting, it suflices to let f — > oo in (3.9) and recall that 
Qo(z) — > q when z — > (-oo, . . . , -oo). Finally, for the converse implication, assume that q is 
not pure; in particular, assume that q lies in the relative interior of a non-singleton subface 
X' - spanned by supp(^). Then, with h regular. Proposition 3.6 shows that q cannot attract a 
relatively open neighborhood U' of initial conditions in X' because (ED) remains volume- 
preserving when restricted to any subface X' of X. In turn, this implies that q cannot be 
attracting in X and precludes asymptotic stability, as claimed. 

Negative temperatures. For T < Q, the fact that every vertex of X is attracting follows 
from Proposition 3.5; Lyapunov stability then follows from (3.3) by noting that if Zk/jiO) < 
-Mi;\T\^\ then we will have Zjt^(f) < Zki^iO) for all f > (cf the proof of Proposition 3.5). 
Conversely, assume that q e X is a non-pure Lyapunov stable state. Then, by passing to a 
subface of X if necessary, we may assume that q is actually interior. In that case however, if 
we take an interior neighborhood Uofqin X, Proposition 3.6 shows that any neighborhood 
y of ^ that is contained in U will eventually grow to a volume larger than that of U under 
(ED), so there is no open set of trajectories contained in U. This shows that non-pure rest 
points of (ED) cannot be stable and our proof is complete. □ 

In conjunction with our previous results. Theorem 3.7 provides an interesting insight 
into the role of the dynamics' temperature parameter T: for small T > 0, the dynamics 
(ED) are attracted to the interior of X and they only converge to points that are approxi- 
mately Nash; for small T < Q, the bona fide strict Nash equilibria of the game are indeed 
asymptotically stable, but so are all the vertices of X (albeit with very small basins of at- 
tractions); finally, for T = 0, the dynamics (ED) are attracted to strict Nash equilibria and 
only there (see also Fig. 1). We thus obtain the following rule of thumb: for T > 0, the 
dynamics (ED) converge to states that are almost Nash, whereas for T < 0, the dynamics 
converge to Nash states except for a very small fraction of initial conditions. 

As such, from the point of view of control and optimization, if one seeks to reach the 
strict Nash equilibria of the game (e.g. as is usually the case when the game is a potential 
one), it would appear that the zero temperature case provides the best convergence proper- 
ties. Nonetheless, there are two important caveats to keep in mind: First, if the dynamics 
(ED) are to be properly implemented as a discrete-time algorithm, then the results of the 
next section show that the positive temperature regime is much more stable - all the while 
allowing players to converge arbitrarily close to a strict equilibrium. On the other hand, if 
one is only interested in the convergence speed of the dynamics (ED), then even arbitrarily 
small negative temperatures yield convergence rates that are exponentially faster than the 
r = case: 

Proposition 3.8. Let (5 = ®(3\r. A, u) be a finite game and let h: X — > be a regular 
entropy function with choice map Q: Hyt 'R'^' — > X. If q* — (aifl, . . . ,aff,o) ci strict 
equilibrium of (S> and x{t) is an interior solution of (ED) which starts sufficiently close to 
q*, then, for all T <(), we will have 

JT\t _ 1 

Zk,(t) ~ Zk,(0)e^''^' + Auk,(q*)—^, (3.10) 

where, as before, Zkii - — are the relative scores of the players ' equilibrium actions, 
Auiifi — Uk^ — Ukfi are the corresponding payoff differences, and we are using the notational 
convention (e"' — l)/0 — t. 
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Put differently, we will have Zk/iit) - 0(exp(|r|f)) /or T < and Zk/iit) - O(t)for T - 0: 
the relative scores Zk^i escape to negative infinity exponentially faster for T < than for 

r = 0. 

Corollary 3.9. // q* — (ai o, . . . , 0^0) is a strict Nash equilibrium of © and x{t) is an 
interior solution of the temperature-adjusted replicator dynamics (T-RD) that starts close 
enough to q*, then 

ri_e-0(exp(mo) forT<0, 
-•»«~|l-.-<') forT.O. 
By contrast, limsup,_,;^ Xkfi(t) < 1 whenever T > 0. 

Remark 1. It is not too hard to obtain expressions similar to (3.11) for more general choice 
functions Q, but the end result is not as concise, so we omit it. 

Proof. Proof of Proposition 3.8. Pick some e > and let x{t) start close enough to q* so 
that (1 + s)Auiifjiq*) < AMi^(x(f)) < (1 - s)Auk^{q*); that this is possible follows from 
the Lyapunov property of strict equilibria established in Theorem 3.7 (recall also that 
Auii/jiq*) < for all yU e Akx) = -A-kMakfi], € 3\f, because q* is a strict equilibrium 
of (5). We will thus have: 

(1 + s)Auk^{q*) < Zk^, + Tzk^, < (1 - s)Auk^(q*), (3.12) 

so if we multiply by e^' and integrate, we readily obtain: 

(,Ti _ 1 e^' - 1 

(1 + s)Auk^,(q*)—^ < e^'zk^^(t) - Zk^,(0) < (1 - s)Auk^(q*)—^, (3.13) 

with the convention that (e°' - l)/0 = f. Our assertion then follows by rearranging terms 
in (3.13) above and noting that s can be taken arbitrarily small since Zkfi(t) — » -0° for all 

yu e Akfi, keJ{. a 

Proof Proof of Corollary 3.9. For Boltzmann action selection as in (2.6), we will have 

Xk.o = (1 + Ziji ^WiZkfS) SO the estimate (3.11) follows from (3.10) and the Taylor expan- 
sion 1/(1 + i) ~ 1 - i + 0(5^). □ 



4. Discrete-time learning algorithms and stochastic approximations 

In this section, we examine how the entropy-driven dynamics (ERL) and (ED) may be 
used to design learning algorithms in the context of finite games that are played repeatedly 
over time. The main challenge in this endeavor is that in practical implementations, players 
can only observe the payoffs that they actually receive when playing the game (or even only 
a noisy version thereof), whereas the dynamics (ERL)/(ED) involve the expected payoffs 
Ukaix). Therefore, in the absence of perfect monitoring (or any other device permitting the 
calculation of expected payoffs), any discretization of the dynamics (ERL)/(ED) should 
involve only the players' in-game payoff streams and no other information. 

A natural way of addressing this issue is to take an Euler-like discretization of the dy- 
namics, use the players' evolving mixed strategies to select an action at each stage, and 
update only those components for which payoffs were actually observed. In what follows, 
we will give a brief account of this approach (known as stochastic approximation) and then 
apply it directly to the dynamics (ERL) and (ED). 
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4. 1 . Stochastic approximation of continuous dynamics. For completeness, we recall 
here a few general elements from the theory of stochastic approximation following Benaim 
[5] and Borkar [7]. To begin with, let § be a finite set, and let Z(«), « € N be a R^-valued 
stochastic process that satisfies the recursion 

Z(n + 1) = Z(n) + y„+iU{n + 1), (4.1) 

where j„ is a sequence of step sizes (usually assumed to vanish with n) and U(n) is another 
IR®-valued process which is adapted to the filtration IF of Z. Then, if there exists a Lipschitz- 
continuous vector field /: such that E[U(n + 1)|3'„] - f(Z{n)), we will say that 

(4.1) is a stochastic approximation of the continuous-time dynamical system 

z = f(z). (MD) 

More explicitly, if we split the so-called innovation term U(n) of (4.1) into its average 
value f{Z(n)) - E[U(n + 1)\3^„] and a zero-mean noise term V{n +1) - U(n + 1) - /(Z(n)), 
(4. 1) may be rewritten as 

Z(« + 1) = Z(«) + r„+i (/(Z(«)) + V(n + 1)) , (SA) 

which is just a noisy Euler-like discretization of (MD); conversely, the equation (MD) will 
be referred to as the mean dynamics of the stochastic recursion (SA). 

The main goal of the theory of stochastic approximation is to relate the process (SA) 
with the solution trajectories of the mean dynamics (MD). To that end, the standard as- 
sumptions that make this comparison possible are: 

(A]) The step sequence y,, is {{^ - ^')-summable (typically, y„ = 1/n). 

(A2) V(n) is a martingale difference with sup„ E [||y(n)|P] < 00. 

(A3) The stochastic process Z(n) is bounded: sup,, ||Z(n)|| < 00 (a.s.). 

Under these assumptions, the following lemma ensures that Z(n) can only converge to a 
connected set of rest points of the corresponding mean dynamics: 

Lemma 4.1. Assume that the dynamics (MD) admit a strict Lyapunov function (i.e. a real- 
valued function which decreases along every non-stationary solution orbit of (MD)), and 
assume further that the set of values of this function at the rest points of (MD) has measure 
zero in R. Then, under the assumptions (Ai)-(A3), every accumulation point of the process 
Z(n) generated by the recursion (SA) belongs to a connected set of rest points of the mean 
dynamics (MD). 

Proof. Proof. Our claim is a direct consequence of the following string of results in Be- 
naim [5] (listed in order of successive implications): Proposition 4.2, Proposition 4. 1, The- 
orem 5.7, and Proposition 6.4. □ 

As an application of the previous lemma, let us consider a game ® and a stochastic 
approximation of the entropy-driven dynamics (ED) that satisfies conditions (Ai)-(A3). If 
(5 admits a potential function U, Lemma 3.3 shows that the free entropy F - Th - U 
is Lyapunov for the entropy-driven dynamics (ERL)/(ED), and Sard's theorem (Lee [18]) 
ensures that the set of values taken by F at its critical points has measure zero. Thus, in 
view of Proposition 3.2, we see that stochastic approximations of (ERL)/(ED) may only 
converge to connected sets of restricted QRE of ®. In what follows, we will exploit this 
property in order to derive two entropy-driven learning algorithms based respectively on 
the score dynamics (ERE) and the strategy-based dynamics (ED). 
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Remark 1. We should note here that (SA) imphcitly assumes that each component of the 
vector Z is updated simultaneously. In a game theoretic setting, this corresponds to com- 
plete player synchronization, an assumption which does not always hold; we will address 
such issues in Section 4.4. 

4.2. Score-based implementation of entropy-driven learning. We begin by construct- 
ing a discrete-time stochastic approximation of the score-based entropic dynamics (ERL) 
as follows: at each step, players play a smoothed best response to the performance scores 
of their actions (using the choice map Q defined in Section 2.3), and then update these 
scores depending on the payoffs they receive from the chosen action. We illustrate this 
process in Algorithm 1 below (presented in a synchronous version, with players selecting 
actions and receiving payoffs simultaneously): 

Algorithm 1 Score-based learning with entropy-driven action selection, 
n <-0 

foreach player k and action a e Ak do initialize Yka and set Xk <— Qk(Yk) 
Repeat 

«<—«+! 

foreach player k eK simultaneously do 

select a new action Qk according to the mixed strategy Xk # current action 

Uk <— Ukia) # current payoff 

Yk&i, <— Y/iai^ + y„(_Uk - TYkai)IXkat # update current action score 

foreach action aeAkdo Xua <- Qka(Yk) # update mixed strategy 



To study the convergence properties of Algorithm 1, let Y{n) denote the players' score 
profile at the n-th iteration of the algorithm - and similarly for X{n) (strategies), a(n) 
(actions) and u(n) (payoffs). Then, Y{n) is a stochastic process adapted to the filtration 3^„ 
generated by X and satisfies the relation: 

E [(Yk,(n + 1) - y,<,(n))/r,,+i I = E[«i(n + ^Wn] - TYkM = Uka(X(n)) - TYkJn), 

(4.2) 

for all a e Ak, k e 3^. Together with the selection rule Xk(n) - Qk(Yk(n)), the RHS of the 
above expression yields the entropy-driven score dynamics (ERL), so the strategy process 
X(n) generated by Algorithm 1 will be a stochastic approximation of (ERL). 

In the special case where T - I, Algorithm 1 boils down to the Q-learning scheme 
of Leslie and Collins [19] which, under the assumption that Y(n) remains bounded (a.s.), 
was proven to converge to Nash distributions (the analogue of a QRE with rationality level 
g - 1) in various classes of 2-player games. Unfortunately, the unconditional convergence 
of this algorithm still eludes us because assumptions (A2) and (A3) are hard to verify: in 
fact, one can check that the order of the noise term V{n) is 

E[||y(« + 1)||2|J„] = 0((1 + ||F(«)||2)ell>'(")ll), (4.3) 

so (A2) rests on first establishing the boundedness requirement (A3).'** However, estab- 
lishing (A3) is not trivial in itself because of the "almost surely" requirement; it seems to 
be possible to do so thanks to an argument by M. Faure (personal communication), but 
since Algorithm 1 is not the main focus of our paper, we will not venture further along this 
direction. 

Note that this is also true for the weaker requirement supplied by Borkar [7], namely that there exist K such 
that E[\\V(n + l)|p|5-„] < K(l + \\¥„\\-) for all n. 
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Remark 1. We note here in passing that there exist alternative conditions (replacing As- 
sumptions (Ai), (A2) and (A3) above) under which a stochastic approximation process can 
be shown to track its underlying mean dynamics - see also the relevant discussion in Leslie 
and Collins [ 19] . For instance, if we truncate the stochastic process V{n) with a sequence of 
expanding bounds as in Sharia [31], then the V(n) are not required to be a priori bounded; 
in that case however, the required summability conditions on E[||y(« + l)|p|9^„] boil down 
to showing that Y{n) is itself bounded, so the original difficulty remains. In any case, our 
focus here will be on the strategy-based algorithm of the next section where such consid- 
erations are not required, so we will not address such issues further. 

4.3. Strategy-based implementation of entropy-driven learning. In this section, we 
will focus on the strategy-based variant of the entropic game dynamics (ED), and we will 
implement it as a payoff-based learning procedure in discrete time. One of the advantages 
of this approach is that it does not rely on having a closed form expression for the choice 
map Q (which is hard to obtain for non-Boltzmann action selection); another is that since 
the algorithm is strategy-based (and hence its update variables are bounded by default), 
we will not need to worry too much about satisfying conditions (A2) and (A3) as in the 
case of Algorithm 1 . As a result, the strategy-based implementation of the entropic game 
dynamics (ED) wiU be significantly easier to handle than its score-based variant. 
Without further ado, we have: 



Algorithm 2 Strategy-based implementation of entropy-driven learning, 
n ^ 

foreach player A: e ?\f do initialize as a mixed strategy with full support 
Repeat 

n « + 1 

foreach player keJi simultaneously do 

select a new action Ok according to the mixed strategy # current action 

foreach player k eJido 

Uk ^ Ukiai, . . . ,aM) # current payoff 

foreach action a e Au do # update mixed strategy 



where gaX) = e'{Xk^) - 0;'(Xt) 2^ 0'(Xkp)/6"{Xk^) is the entropy adjustment 
term of (ED). 



As stated above, the update step of Algorithm 2 has been designed so as to track the 
entropic dynamics (ED). Indeed, letting X(n) (resp. a{n), u(n)) denote the players' strategy 
(resp. action, payoff) profile at the n-th iteration of the algorithm, we will have for all 
a e Ak, k el^: 



which is simply the RHS of the entropy-driven dynamics (ED) evaluated at X(n). 




E[{XkAn + l)-XkAn))/y„^i\%] 




(4.4) 
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Remark 1. In the case of the Boltzmann-Gibbs kernel 0(x) = xlogx, the update step of 
Algorithm 2 becomes 

Xka ^ + y„+\ [{ti&k = Q-) - Xka) ■ h ' TX^a (logZ^^ - Xj,/} logZ^)] . (4.5) 

For zero temperatures, we thus obtain the reinforcement learning scheme of Sastry et al. 
[30] that was based on the classical replicator equation (RD). 

On the other hand, unlike Algorithm 1 (which was evolving in R'^), Algorithm 2 is well- 
defined only if the iterates X are admissible mixed strategies at each update step. To check 
that this is indeed the case, note first that the second term of the update step of Algorithm 2 
vanishes when summed over a e At, so Yja^kain) remains constant and equal to 1 (recall 
that X/tiO) is initialized as a valid probability distribution). It thus suffices to check that 
Xkain) > for all a e At', restricting ourselves to positive learning temperatures T > and 
normalizing the game's payoffs to [0, 1] for simplicity, we have: 

Lemma 4.2. Let be a regular entropy kernel such that x6"{x) > mfor some m > and 
for all X e (0, 1). Then, for T > and normalized payoffs Uk'. A ^ [0, 1], there exists a 
positive constant K > (which only depends on T and 6) such that if the step sequence 
is bounded from above by K, then Xi^ain) > for all a e Ai^, A: € 3\f, and for all n > 0. 

Proof. Proof. For notational simplicity, we will only consider here the single-player case, 
the general case being similar To that end, we first claim that the entropic term of the 
update step of Algorithm 2 is bounded by a constant Cg; indeed: 

• 6' is increasing, hence 6'(^) < 6'(1) for all ^ € (0, 1). 

• For all X e X = A(A), 1/0"(jc^) > max^ ^^"(x^), so 0J,'(x) < min^0"(x^) < 
max{6i"(f) : card(yi)"' < ^ < I)- Thus there exists 0" such that ©" > 

' ^ ' — ^ — ' /7,max /j,max 

0;;(x) > for all xeX. 

• By the regularity assumption for 6, we will have ff{^)l6"{^) — » as ^ — > 0^, so 
there exists M > such that |6''(^)/6'"(x)| < M for all ^ e (0, 1). 

As a result, our claim follows by taking Cg - 6'{\) + ©J,'^^^ card(yi)M. 

Now, letting a be the chosen action at step n and u the corresponding payoff, we will 
have: 

— |— [Tg,{x) - ulx& (l(d = a) - 0;'(x)/0"(x^))] 

- ^ [^*^» + m/x^0"(x^)(0;;(x) - l(a = a)6"(x&))] 

- [^^« + ®'^™x ■ « ixaO"(x,)r'] < m-' {TCg + m-'0;;,„,,) , (4.6) 

where we used the assumption T > for the first inequality and the normalization u e [0, 1] 
for the second and third one. Hence, if y„ < m ' (jCg + '0/,',„ax) = ^^^^ have 

Xa{n -H 1) > Xa{n){\ - y„+\K) > whenever Xa{n) > 0, and the induction is complete. □ 

Under the assumptions of Lemma 4.2 above. Algorithm 2 is well-defined and is no risk 
of crashing; we now show that if (5 is a potential game, then Algorithm 2 converges to a 
QRE of © almost surely: 

Theorem 4.3. Let ® be a potential game, and let 6 be a regular entropy kernel such that 
x0"(x) > mfor some m > and for all x e (0, 1). Then, for T > and sufficiently small 
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Step sizes 7„ satisfying (Ai), Algorithm 2 converges almost surely to a connected set of 
QRE of do with rationality level g — l/T. 

Proof. Proof. By the proof of Lemma 4.2, assumptions (A2) and (A3) for the iterates X(n) 
of Algorithm 2 are verified immediately - simply note that the innovation term of the 
update step of Algorithm 2 is bounded by the constant K of Lemma 4.2. Thus, by Lemma 
4.1 and the subsequent discussion, it follows that X(n) converges to a connected set of 
restricted QRE of ®. 

We will now show that the accumulation points of X{n) can only lie in the relative 
interior rel int(X) of X. To that end, let Zkfi = O'ixk/i) - O'ixkfi) be the reduced score variables 
of (2. 11) and let W^t^ = Am^^ - Tzk^i denote the RHS of the reduced score dynamics (ZD), 
viz. z = W(z)- Then, following Borkar and Meyn [8], the limiting ODE of the dynamics 
(ZD) will be: 

^Uku(Qo(rz)) - Trz 
z = W^z) = lim W(rz)/r = Hm ^ = -Tz, (ZD^) 

where Qo- z t-^ x is the reduced choice map of (2.13). For T > Q, the origin is a global 
attractor of (ZD^o), so Theorem 2.1 in Borkar and Meyn [8] implies that the discrete sto- 
chastic approximation Zk^in) - O'iXk^in)) - 6'(Xkfl(n)) of (ZD) will be bounded almost 
surely. Moreover, given that Qq is a homeomorphism onto relint(X), the image of any 
compact subset of Y\k 'R'^' " will be a compact subset of rel mt(X), so any accumulation 
point of the process X = QoiZ) will lie in rel int(X), as claimed. Since X{n) was shown to 
converge to a connected set of restricted QRE of ®, our assertion follows. □ 

Remark 1. It is important to note here that Theorem 4.3 holds for any T > 0, so Algo- 
rithm 2 may be tuned to converge to QRE with arbitrarily high rationality level g - l/T 
- and hence, arbitrarily close to the game's strict Nash equilibria; cf. the discussion fol- 
lowing Theorem 3.7 in Section 3. In this way. Theorem 4.3 is different in scope than the 
convergence results of Cominetti et al. [10] and Bravo [9]: instead of taking high learning 
temperatures to guarantee a unique QRE, players who employ low learning temperatures 
may converge arbitrarily close to the game's strict equilibria. 

Remark 2. In view of the above, one might hope that Algorithm 2 would still converge 
to the game's (strict) Nash equilibria even for T = 0. In that case however, the limiting 
ODE (ZD,v) no longer admits a global attractor at the origin, so the relative scores Zk/j 
may fail to remain bounded and we cannot discount the convergence of Algorithm 2 to 
non-Nash vertices of X. In fact, even in the simplest possible case of a single player game 
with two actions, Lamberton et al. [16] showed that the T = version of Algorithm 2 with 
Boltzmann action selection fails to converge a.s. to a Nash equilibrium for step sequences 
of the form y„ = 1 /«'', < h < I. 



4.4. Robustness of the strategy-based learning algorithm. Even though Algorithm 2 
only requires players to observe and record their in-game payoffs, it still relies on the 
following assumptions: 

(1) Players have perfect observations of their payoffs. 

(2) Players all update their strategies at the same time. 

(3) There is no delay between playing and receiving payoffs. 

Since these assumptions are often violated in practical scenarios, we devote the rest of this 
section to examining the robustness of Algorithm 2 in this more general setting. 
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(c) Distribution after n = 5 iterations. (d) Distribution after n = 10 iterations. 



Figure 2. Snapshots of tlie evolution of of Algorithm 2. In our simulations, we drew 
10"* random initial strategies in the potential game of Fig. 1 and, for each strategy alloca- 
tion, we ran the Boltzmann variant of Algorithm 2 with learning temperature T = 0.2 and 
step sequence y„ = 1/(5 + n''^). In each figure, the shades of gray represent the normalized 
density of states at each point of the game's strategy space and we also drew the phase 
portraits of the underlying mean dynamics (ED) for convenience. We see that Algorithm 
2 converges to the game's QRE (which, for T = l/g = 0.2 are very close to the game's 
strict equilibria) quite rapidly: after only n = \0 iterations, more than 99% of the initial 
strategies have converged within e = 0.01 of the game's equilibria. 

Noisy measurements and stochastic perturbations. In many real-world applications of game 
theory (and especially in traffic congestion games), the payoffs received by the players at 
each stage of the game may be subject to random shocks (Hofbauer and Sandholm [13], 
Mertikopoulos and Moustakas [24]) or players may only be able to get a rough measure- 
ment of their true payoffs (see e.g. Conitzer and Sandholm [11] for an example drawn from 
revenue sharing and mechanism design). 
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We will model such stochastic perturbations by considering a random noise term ^k(n), 
k e J^, which is added to the payoff Uk(n) received by player k at the «-th iteration of 
Algorithm 2 (and which, in principle, might depend on the players' strategy or action 
profiles). Then, with notation as before, we have: 

Proposition 4.4. Let f(n) be an Jn-adapted martingale difference with values in (i.e. 
E[^(n + 1)|3^„] = 0); assume further that is bounded for all k & Ji (a.s.) and that 
is stochastically independent of the chosen action of player k. Then, the conclusion 
of Theorem 4.3 still holds when the payoff stream u^ in Algorithm 2 is replaced by the 
perturbed process Uk - Uk+ ^t- 

Proof. Proof. Since the noise process ^ is bounded (a.s.), we can still assume that the 
perturbed payoff functions are normalized in [0, 1]; hence, by taking a sufficiently small 
step sequence. Algorithm 2 remains well-defined (i.e. X{n) e X for all n). It thus suffices to 
check that the conditional expectation of the innovation term of (4.3) with u^ replaced by 
Uk = Uk+^k Still yields the entropy-driven dynamics (ED^). That this is so, is an immediate 
consequence of the independence between ^k and Ok', indeed: 

E[uk{n + 1) Mokin + I) ^ »)!?■„] 

= E [(Mn + 1) + ^k{n + 1)) l(a,(n + 1) = a)| J„] 

= Uka{X{n)) + E {^kin + 1)|J„] Xka{n) = M,,(X(n)), (4.7) 

where the second equality follows from the independence of ^ and a, and the last one stems 
from the fact that ^(n) is an 5„-adapted martingale difference. The entropic dynamics (ED) 
are then obtained as in (4.4). □ 

Remark 1. We should note here that the assumptions on the noise term ^{n) of Proposi- 
tion 4.4 are rather mild: they encompass not only the case where the perturbations are 
independent of the players' choices (a case which has attracted significant interest in the 
literature by itself), but also scenarios where ^{n +\) might depend on the entire history of 
the game up to stage n, or even when the noise stems from imperfect observations of the 
other players' current actions (i.e. at the (n + l)-th stage of the game). 

Asynchronous updates and delays. In Algorithm 2, it is assumed that players update their 
strategies at every iteration simultaneously, i.e. they adhere to a synchronous strategy re- 
vision process. On the other hand, in congestion games and applications (e.g. in wireless 
networks), it is often the case that revisions are asynchronous: for instance, if we con- 
sider a set of wireless users communicating with a slotted ALOHA base station [1], then 
each user's decision to transmit or remain silent at a given timeslot is not coordinated with 
other users, so updates and strategy revisions occur at different periods for each user. Fur- 
thermore, in the same scenario, variable message propagation delays often mean that the 
outcome of a user's choice does not depend on the choices of other users at the current 
timeslot, but on their choices in previous timeslots. 

In view of the above, the first extension to Algorithm 2 that we will consider here 
is the case where only a random subset of players (possibly of cardinality 1) revises their 
strategies at a given iteration of the algorithm. To that end, let R„ c 2-'^ be the random set of 
players who update their strategies at the «-th iteration of the algorithm. In practice, players 
are not aware of the global iteration counter n but only know the number of updates that 
they have carried out up to time n, as measured by the random variables (fikin) = card{m < 
n : k e R^], k e 3^. Accordingly, the asynchronous variant of Algorithm 2 that we will 
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consider consists of replacing the instruction "for each player k € ^T' by "for each player 
k e R„" and replacing "«" by "^^(n)" in the step-size computation. 

Furthermore, as noted above, another natural extension of Algorithm 2 consists of al- 
lowing the (possibly perturbed) payoffs perceived by the players to be subject to delays. 
Formally, let Tka{n) be the (integer-valued) delay that player k experiences when playing 
his ff-th action at step n. Then, the payoff Uk of player k in (4.3) at step n should be replaced 
by Uk{aii{n); a^kin - Tkain))), with expected value conditioned on the history 3>, given by 

Uk,ai,(n)(X(n - Tka(n))). 

Following Chapter 7 of Borkar [7], we will make the following assumptions regarding 
these two extensions of Algorithm 2: 

(1) The step sequence is of the form y„ = K/n, where A' is a positive constant small 
enough to guarantee that Algorithm 2 remains well-defined for all n. 

(2) The strategy revision process R„ is a homogeneous ergodic Markov chain over 2-'^; 
in particular, if fi is its (necessarily unique) stationary distribution, then the asymp- 
totic update rate of player k will be rjk - YjAqj^ t^(A) l(k e A) = YjAqt^-ma A'(^)- 

(3) The delay processes Tta(n) are bounded, < Tkairi) < M (a.s.) for some M > 
and for all « e N. This condition ensures that the delay bias becomes negligible as 
time steps are aggregated. 

These hypotheses are rather mild in themselves, but they can be weakened even further at 
the expense of presentational simplicity and clarity (see e.g. Chapter 7 of Borkar [7]). Still 
and all, we have: 

Proposition 4.5. Under the previous assumptions, the conclusion of Theorem 4.3 still 
holds for the variant of Algorithm 2 with asynchronous updates and payoff delays. 

Proof. Proof. By Theorems 2 and 3 in Chapter 7 of Borkar [7], Algorithm 2 modified to 
account for asynchronous strategy revisions and payoff delays as above, will be a stochastic 
approximation of the rate-adjusted dynamics 

it =/7*EDfl(^i), (4.8) 

where rfi^ is the mean rate at which player k updates its strategy, and ED,, denotes the RHS 
of the entropy-driven dynamics (ED^). In general, the revision rate ri will depend on time 
(leading to a non-autonomous dynamical system), but given that the revision process /?„ 
is a homogeneous ergodic Markov chain, rjk will be equal to the (constant) probability 
of including player k at the revision set /?„ at the «-th iteration of the algorithm. These 
dynamics have the same rest points as (ED) and an easy calculation shows that the free 
entropy Fix) - Th(x) - U(x) of 3.3 remains a strict Lyapunov function for (4.8), so the 
proof of Theorem 4.3 goes through unchanged. □ 

Remark 1. It is important to note here that the entropic dynamics (4.8) adjusted for different 
strategy revision rates are equivalent to the choice-adjusted dynamics (ED,,) which corre- 
spond to players using a different inverse choice temperature (hence the identical notation). 
Therefore, if the players' revision process is a homogeneous ergodic Markov chain, their 
mean revision rates rjk e (0, 1) may also be viewed as inverse choice temperatures of play- 
ers who never miss a revision opportunity, but who tone down their actions' performance 
scores by playing the mixed strategy Xk = Q(jjkyk) instead of Qiyu)-^' 



Note also that t;;^ < 1 so players who do not update all the time tend to choose actions in a more uniform 
manner. 
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4.5. Practical implementation aspects of Algorithm 2. We conclude this section by 
highlighting some implementation features of Algorithm 2: 

(1) First, Algorithm 2 is highly distributed. The information needed to update each 
player's mixed strategy is the payoff of each player's chosen action, so there is no 
need to be able to assess the performance of alternate strategic choices (including 
monitoring other players' actions, or even their existence). Additionally, there 
is no need for player updates to be synchronized: as shown in Section 4.4, each 
player can update his strategies independently of others. 

(2) Algorithm 2 is also robust to random payoff disturbances (or, equivalently, to noisy 
measurements thereof) in potential games. In fact, this stochasticity could result 
from old actions of remote players and it can also stem from faulty obervations 
(such as SINR measurements in wireless networks). At any rate, if these distur- 
bances are not systematically biased. Algorithm 2 guarantees convergence to the 
game's set of QRE at arbitrarily high rationality levels (i.e. arbitrarily close to a 
Nash equilibrium). 

(3) The temperature parameter T should be taken positive in order to guarantee con- 
vergence in potential games. Smaller values yield convergence to QRE that are 
very close to the game's Nash equilibria; on the other hand, such a choice also 
impacts convergence speed because the step sizes have to be taken commensu- 
rately small (for instance, note that the step-size bound of Lemma 4.2 is roughly 
proportional to the learning temperature of the dynamics). As such, tuning the 
temperature T will usually require some problem-dependent rules of thumb; re- 
gardless, our numerical simulations suggest that Algorithm 2 converges within a 
few iterations even for small temperature values (cf. Fig. 2). 

In view of the above, we may derive a player-centered variant of Algorithm 2 as follows: 
first, assume that each player k eJ^ is, equipped with a discrete event timer Tk{n), n e N, 
representing the times at which player k wishes to update his strategies, and such that 
n/Tii{n) > c > for all n e N (so that player k keeps updating at a positive rate). Then, if f 
denotes a global counter that runs through the set of update times T = \Jk{Tk(n) : n e N), 
the corresponding revision set at time f e T will be Rt - {k : Tk(n) - t for some n e Ji], 
leading to the decentralized implementation below: 



Algorithm 3 Decentralized strategy-based implementation of entropy-driven learning. 



n <- 
Repeat 



Event UpdateStrategies occurs at time Tk{n + \) e7 
« «— « + 1 

select a new action ak according to the mixed strategy Xk 
receive payoff ilk (incl. stochastic disturbances and/or delays) 
payoff 

foreach action a eAkdo 




# observe current 



# current action 



^ka *~ Xka + 



&'(Xka) [Xka 



7n Uk 



( 



li&k = a) - 



®^(Xk)_ 

e"(Xka,) 



'k 



where ^^(X) = O'iXka) - @'l,{Xk) Y!/} S'(Xkp)/d"(Xki3) is the entropy adjustment term 
of (ED). 
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